Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.
I am excited to have enrolled Open Data Science
I look forward to learn
Describe the work you have done this week and summarize your learning.
In this work space, a student feedback data(more information here) collected between 2014-2015 was used to asses the relationship between learning approaches and the students achievement in an introductory statistics course. The dataset underwent data wrangling analysis and will be used to develop a linear regression model and interpret the results.
The first step to creating a work directory where we have stored our data
#set the working directory of your R session the IODS project folder
setwd("C:/LocalData/ocholla/IODS-project")
We load the dataset from the folder saved from Data wrangling exercise.We read the file adding that the header= TRUE and stringAsFactors=TRUE to ensure that the factor variables are not displayed as strings.
students2014<-read.csv("Data/learning2014.csv", header = TRUE, stringsAsFactors = TRUE)
We explore the data structure and its dimensions
str(students2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.5 3.1 3.5 3.5 3.7 4.7 3.7 3.1 4.3 4 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
dim(students2014)
## [1] 166 7
The data contains 166 observations of 7 variables. The observations refers to the students and the variables refer to student gender (male or female), age (derived the date of birth), exam points, attitude, deep learning (deep), strategic learning (stra) and surface learning (surf).
summary(students2014)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.600 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.300 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.700 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.666 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.100 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.900 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
The mean age of the students in the class is 25 years, with the youngest being 17 years while the oldest at 55 years. The attitude variable was scaled to 1-5, with good attitude was rated at 5.
Female students were almost twice more than the male students in the course, with deep learning having the highest mean value among the three variables. The lowest exam point attained by the students was 7 with the overall class having an average of 22.
library(ggplot2)
#initialize plot with data and aesthetic mapping
p1<- ggplot(students2014, aes( x= attitude, y= points, col = gender))
#define the visualization type (points)
p2<- p1 + geom_point()
#add a regression line
p3<- p2+geom_smooth(method = "lm")
#add a main title and draw the plot
p4<- p3 + ggtitle("student's attitude versus exam points")
p4
## `geom_smooth()` using formula 'y ~ x'
From the plot above, we can deduce that majority of student across both gender had average attitude which coinciding with attaining average exam points. Female students with a negative attitude towards statistics managed to get higher exam points compared to male students with negative attitude.
#Access the GGally and ggplot2 libraries
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
#create a more advanced plot matrix with ggpairs()
p<- ggpairs(students2014, mapping = aes (col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
#draw the plot
p
From the scatter plot, across both gender the attitude of the student is highly correlated to the exam points attained. While there is higher likelihood of male students failing in deep learning compared to female students
Linear regression model is a statistical approach that is used for modelling relationships between a dependent variable y and one or more explanatory variables X. If there is only one explanatory variable then its called simple linear regression while in more than one variable it is referred as multiple linear regression.
Linear model is divided into two parts namely systematic and stochastic parts. The model follows the equation below \[Y \sim \propto + X\beta_1+X\beta_2.....\beta_k +\epsilon \]
Using the student data set, the exam points has been used as the target variable while surface learning, strategic learning and attitude of the students have be used as independent variables in the model. The three predictors or independent variables were selected based on the correlation between them and the response variable.
my_model<- lm(points~ attitude+stra+age, data = students2014)
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude + stra + age, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.1149 -3.2003 0.3303 3.4129 10.7599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.89543 2.64834 4.114 6.17e-05 ***
## attitude 3.48077 0.56220 6.191 4.72e-09 ***
## stra 1.00371 0.53434 1.878 0.0621 .
## age -0.08822 0.05302 -1.664 0.0981 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared: 0.2182, Adjusted R-squared: 0.2037
## F-statistic: 15.07 on 3 and 162 DF, p-value: 1.07e-08
The call formula captures the model fitted, e.g. in this case it is a multilinear regression with points as target variables while attitude, strategic learning and age are the independent variables.
Residuals Residuals are the difference between the actual observed response value (exam points) and the response values that the model predicted.
The residual section is divided into 5 summary points, and it is used to assess how symmetrical distribution of the model across the points on the mean value zero. From the analysis, we can see the distribution of the residual appear abit symmetrical.
Coefficients
Coefficient gives unknown constants that represent the intercept and unknown parameters in terms of the model. Under the coefficient section , there are four variables.
Estimate The coefficient Estimate contains four rows ; the first one is the intercept \(\propto\) value, while the rest correspond to estimates of the unknown parameters \(\beta_1+\beta_2 ....\beta_k\) in our multi linear regression model.
Coefficient -Standard error
The coefficient Standard Error measures the average amount that the coefficient estimates vary from the actual average value of the response variable (exam points). Ideally, this value should be lower than its coefficient.
Coefficient-t value The coefficient t-vale is a measure of how many standard deviations our coefficient estimate is far from 0. When the t-value is far from zero, this would indicate that we can reject the null hypothesis, that is, we declare there relationship between the dependent variables and the target variable exist.In our analysis, the t-statistic value of attitude is relatively far away from zero and is larger relative to the standard error, which would indicate a relationship exist, compared to variables stra and age.
Coefficient -Pr(<t) Commonly referred to as the p-value, a small p-vale indicates that it is unlikely to observe a relationship between the predictors and the target variable due to chance. Typically, a p-value of 5% or less is a good cut -off point. In our model, the p-vales for intercept and attitude are more closer to zero compared to stra and age.
The “signif.Codes” are associated to each estimate, a three star (or asterisks) represent a highly significant p-value. The three predictors were significant (p \(\leqslant\) 0.05), meaning that we can reject the null hypothesis allowing us to conclude that there is a relationship between exam points and three student variables attitude, stra and age.
Residual standard error is a measure of the quality of a linear regression fit. The residual standard error is the average amount that the response (exam points) will deviate from the true regression line.In my_model the response exam points deviated from the true regression line by approximately 5.26 on average. Note that the Residual Standard Error was calculated with 162 degrees of freedom (DOF). DOF are the number of data points that went into the estimation of the parameters used after taking into account these parameters. In this analysis, we had 166 observation (students), intercept and three predictors parameters.
Multiple R-squared
The R-squared also known as coefficient of determination (\(R^2\)) statistics provides the percentage of the variation within the dependent/response/target variable that the independent variables are explaining.In other word, it describes how well the model is fitting the data. \(R^2\) always lies between 0 and 1, with values approaching 0 indicating a regression that does not explain the variance in the response variable well, while a value closer to 1 explains the variance in the response variable).In my_model, the \(R^2\) is 21.82%, meaning that only 22% of the variance found in the response variable (exam point) can be explained by the predictors attitude, stra and age. Adjusted R-squared is shows what percentage of the variation within the dependent variable that all predictors are explaining.
Three diagnostic plots were plotted from the model
Normal Q-Q plot
Explores the assumption that the errors of the model are normally distributed.When the majority of the residual are falling along the line, then it validates the assumption of normality. Severe deviation of the residuals from the normal line makes the assumption of normality questionable.
Residual vs Leverage plot
Leverage measures how much impact or influence a single observation has on the model. Not all outliers are influential in regression analysis.Cases which have high influence are usually located at the upper right corner or at the lower right corner. Presence of cases is such spots can be influential against a regression line.When cases are outside the Cook’s distance, meaning they have high Cook’s distance scores) , the cases are influential to the regression results, and the regression results will be altered if they are excluded.
par(mfrow= c(2,2))
plot(my_model, which = c(1,2,5))
In the Residual vs Fitted plot, the residual are distributed without following a distinctive pattern indicating that the linear relation ship was explained by the model.
In the Normal Q-Q plot, most of the residual points are falling along the diagonal line, with little deviation at both ends.Since there is minimal deviation of the residuals from the line, the QQ plot strengthens the assumption of normality of the model
In the leverage plot there are no influential cases as we can see the Cook’s distance as all the cases are well inside of the Cook’s distance lines.
date()
## [1] "Sun Dec 05 17:17:08 2021"
References
1. Rmarkdown for Scientist
2. University of Virginia Library
In this exercise, we explore the use of logistic regression analysis on student achievement data in secondary education in two Portuguese schools. The data attribute include the student grades, demographic, social and school related features. The data was collected by using school reports and questionnaires. More information can be found in the link
Purpose of the analysis is to study the relationships between high/low alcohol consumption and students attributes. The joined dataset used in the analysis exercise combines the two student alcohol consumption data sets. The following adjustment have been made;
#set the work directory
setwd("C:/LocalData/ocholla/IODS-project")
#Load the data
alc<- read.csv("Data/pormath.csv", header = TRUE, stringsAsFactors = TRUE)
#print column names
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "n" "id.p" "id.m"
## [31] "failures" "paid" "absences" "G1" "G2"
## [36] "G3" "failures.p" "paid.p" "absences.p" "G1.p"
## [41] "G2.p" "G3.p" "failures.m" "paid.m" "absences.m"
## [46] "G1.m" "G2.m" "G3.m" "alc_use" "high_use"
## [51] "cid"
#structure of the data
str(alc)
## 'data.frame': 370 obs. of 51 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
## $ age : int 15 15 15 15 15 15 15 15 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 1 1 1 1 1 1 1 1 1 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 1 1 1 1 1 2 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 2 2 2 2 2 2 2 2 2 1 ...
## $ Medu : int 1 1 2 2 3 3 3 2 3 3 ...
## $ Fedu : int 1 1 2 4 3 4 4 2 1 3 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 3 1 4 4 4 4 2 3 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 3 3 3 2 4 2 5 4 3 2 ...
## $ reason : Factor w/ 4 levels "course","home",..: 2 4 4 1 4 1 1 4 4 4 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 2 2 2 3 2 1 2 1 1 ...
## $ traveltime: int 2 1 1 1 2 1 2 2 2 1 ...
## $ studytime : int 4 2 1 3 3 3 3 2 4 4 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 1 2 1 2 ...
## $ famsup : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 1 ...
## $ activities: Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 1 1 1 1 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 1 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 2 2 1 2 2 2 2 2 2 1 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 2 1 1 1 ...
## $ famrel : int 3 3 4 4 4 4 4 4 4 4 ...
## $ freetime : int 1 3 3 3 2 3 2 1 4 3 ...
## $ goout : int 2 4 1 2 1 2 2 3 2 3 ...
## $ Dalc : int 1 2 1 1 2 1 2 1 2 1 ...
## $ Walc : int 1 4 1 1 3 1 2 3 3 1 ...
## $ health : int 1 5 2 5 3 5 5 4 3 4 ...
## $ n : int 2 2 2 2 2 2 2 2 2 2 ...
## $ id.p : int 1096 1073 1040 1025 1166 1039 1131 1069 1070 1106 ...
## $ id.m : int 2096 2073 2040 2025 2153 2039 2131 2069 2070 2106 ...
## $ failures : int 0 1 0 0 1 0 1 0 0 0 ...
## $ paid : Factor w/ 2 levels "no","yes": 2 1 1 1 2 1 1 1 1 1 ...
## $ absences : int 3 2 8 2 5 2 0 1 9 10 ...
## $ G1 : int 10 10 14 10 12 12 11 10 16 10 ...
## $ G2 : int 12 8 13 10 12 12 6 10 16 10 ...
## $ G3 : int 12 8 12 9 12 12 6 10 16 10 ...
## $ failures.p: int 0 0 0 0 0 0 0 0 0 0 ...
## $ paid.p : Factor w/ 2 levels "no","yes": 2 1 1 1 2 1 1 1 1 1 ...
## $ absences.p: int 4 2 8 2 2 2 0 0 6 10 ...
## $ G1.p : int 13 13 14 10 13 11 10 11 15 10 ...
## $ G2.p : int 13 11 13 11 13 12 11 10 15 10 ...
## $ G3.p : int 13 11 12 10 13 12 12 11 15 10 ...
## $ failures.m: int 1 2 0 0 2 0 2 0 0 0 ...
## $ paid.m : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 1 2 1 1 ...
## $ absences.m: int 2 2 8 2 8 2 0 2 12 10 ...
## $ G1.m : int 7 8 14 10 10 12 12 8 16 10 ...
## $ G2.m : int 10 6 13 9 10 12 0 9 16 11 ...
## $ G3.m : int 10 5 13 8 10 11 0 8 16 11 ...
## $ alc_use : num 1 3 1 1 2.5 1 2 2 2.5 1 ...
## $ high_use : logi FALSE TRUE FALSE FALSE TRUE FALSE ...
## $ cid : int 3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 ...
The dataset contains 370 observation of 51 variables, the variables are a mixture of integers, factors and logical data types
Hypothesis on the four variables
Distribution of alcohol consumption by gender
library(ggplot2)
g1<- ggplot(data= alc, aes(x = high_use, fill= sex))
#define the plot as a bar plot and draw it
g1+ geom_bar()+ ggtitle("Alcohol consumption across students by gender")
High proportion of male students indulge in high alcohol consumption compared to female students, whereas, more female student population consume low alcohol in comparison to the male student with low alcohol intake.
Relationship between sex, quality family relationships and use of alcohol
#*summary statistics by sex and quality family relationships
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
#produce summary statistics by group
alc%>% group_by(sex, high_use)%>% summarise(count= n(), familytime=mean(famrel))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 4 x 4
## # Groups: sex [2]
## sex high_use count familytime
## <fct> <lgl> <int> <dbl>
## 1 F FALSE 154 3.92
## 2 F TRUE 41 3.73
## 3 M FALSE 105 4.13
## 4 M TRUE 70 3.79
Four-fifth of all the female students have low alcohol consumption and they have higher mean in the quality of family relationship compared to female students who had high alcohol consumption. Similarly, male students with high family relationship had low alcohol consumption.
Relationship between sex and incidences of going out with alcohol consumption among the students
#*failures
alc%>% group_by(sex, high_use)%>% summarise(count= n(), going_out=mean(goout))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 4 x 4
## # Groups: sex [2]
## sex high_use count going_out
## <fct> <lgl> <int> <dbl>
## 1 F FALSE 154 2.95
## 2 F TRUE 41 3.39
## 3 M FALSE 105 2.70
## 4 M TRUE 70 3.93
In both genders, higher incidences of going out are associated with high consumption of alcohol
Does high use of alcohol have a connection to study time?
#does high use of alcohol have a connection to romantic relationship?
a2<- ggplot(alc, aes(x= high_use, y = studytime, col= sex))
#define the plot as a boxplot and draw it
a2 + geom_boxplot()+ylab("Hours")+ggtitle("Student study hours by alcohol consumption and sex")
Among students with low alcohol consumption, the female students spent between 5 to 10 hours in their studies compared to male students who spent between 3-5 hours.
On the contrary, female students with high alcohol consumption spent exactly 5 hours for their studies while the make students spent 2 to 5 hours.
#find the model with glm()
m<- glm(high_use~sex+studytime+famrel+goout, data = alc, family = "binomial")
#print out a summary of the model
summary(m)
##
## Call:
## glm(formula = high_use ~ sex + studytime + famrel + goout, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5891 -0.7629 -0.4976 0.8304 2.6937
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.2672 0.7312 -1.733 0.08309 .
## sexM 0.7959 0.2669 2.982 0.00286 **
## studytime -0.4814 0.1711 -2.814 0.00490 **
## famrel -0.4193 0.1399 -2.996 0.00273 **
## goout 0.7873 0.1230 6.399 1.57e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 370.69 on 365 degrees of freedom
## AIC: 380.69
##
## Number of Fisher Scoring iterations: 4
From the model, the p-values for the four variables are statistically significant. The output supports the hypothesis that gender, study time, going out and quality family relationship affects the student’s alcohol consumption. The positive estimate value on the factor variable sex indicates likelihood of high use of alcohol among male students with reference to the female students
Model coefficients
#Print out the coefficients of the model
coef(m)
## (Intercept) sexM studytime famrel goout
## -1.2672175 0.7959494 -0.4813712 -0.4192830 0.7872896
Based on the regression coefficients, the odds of high use of alcohol increased with students being male and those with a high frequency of going out, while it decreased with in students who allocate more study time and have excellent family relationships. Odd ratio Odd ratio is the ratio of expected “successes” to failures.Higher odds corresponds to a higher probability of success, with the value ranging from zero to infinity.
Transform the coefficients to odds ratios and transform it by exponentiation it to return it as a unit change in the predictor variable (high_use), holding all other predictor variables constant
#compute odds ratios (OR)
OR<- coef(m)%>% exp
#compute confidence intervals (CI)
CI<- confint(m)%>% exp
## Waiting for profiling to be done...
#Print out te odds ratios with their CI
cbind(OR,CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.2816141 0.06545486 1.1622100
## sexM 2.2165443 1.31841735 3.7630180
## studytime 0.6179355 0.43751933 0.8576353
## famrel 0.6575181 0.49791884 0.8636137
## goout 2.1974324 1.73873662 2.8198312
With a confidence interval of 95%, the likelihood of high use of alcohol increased by a factor of 2.21 and 2.19 when the student is of male gender and has high tendency of going going respectively. This means by that On the contrary, the odds of high use of alcohol among students who allocate more study time and have good family relationships drops by -38.21% and -34.25%. Meaning in every one hour added in study time chances of high alcohol reduces by 38.21% while improvement of the family relationship reduces high use of alcohol by 34.25%.
Predictive power of the model Exploring the predictive power of the logistic model, through a 2x2 cross tabulation of predictors versus the actual values.
#predict() the probability of high_use
probabilities <- predict(m, type = "response")
#add the predicted probabilities to alc
alc<- mutate(alc, probability= probabilities)
#use the probabilities to make a prediction of high_use
alc<- mutate(alc, prediction= probability>0.5)
#*see the first five original classes, predicted probabilities,
select(alc, goout, famrel,studytime, sex, high_use,probability, prediction)%>% head(5)
## goout famrel studytime sex high_use probability prediction
## 1 2 3 4 F FALSE 0.05335420 FALSE
## 2 4 3 2 F TRUE 0.41613728 FALSE
## 3 1 4 1 F FALSE 0.06670564 FALSE
## 4 2 4 3 F FALSE 0.05657850 FALSE
## 5 1 4 3 F TRUE 0.02656662 FALSE
The chances of high alcohol consumption declines when the student is female
Confusion matrix (2x2) table
#tabulate the target variable versus the predictions
table(high_use=alc$high_use, prediction=alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 230 29
## TRUE 59 52
From the table, 230 low alcohol consumption was corrected while 29 were incorrectly predicted. on the other hand, 52 observation of high alcohol consumption were correctly predicted while 59 were wrongly predicted.
plotting the original high_use versus the predicted in alc
#initialize a plot of 'high_use' versus 'probability' in alc
g<- ggplot(alc, aes(x = probability, y= high_use, col = prediction))
#define the geom as points and draw the plot
g+geom_point()
Accuracy assessment through the confusion matrix
#tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)%>% prop.table%>% addmargins
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.62162162 0.07837838 0.70000000
## TRUE 0.15945946 0.14054054 0.30000000
## Sum 0.78108108 0.21891892 1.00000000
Approximately 19% and 36% were inaccurately predicted for the false and true values respectively by the model.
From the confusion matrix table, the accuracy of the prediction can be calculated through addition of True Negative and Positive (0.62 and 0.14) divided by the summation of both true and false values in both the original set (high_use) and prediction. we get that the prediction accuracy is 0.76, this is a high performance from the model compared to simple guessing strategies.
This is a method of testing a predictive model on unseen data split the data into training and testing data, whereby the training data is used to find the model while the test data is used to make prediction and evaluate the model performance
One round of cross validation involves
This process is repeated so that eventually all of the data is used for both training and testing.In CV, the value of a penalty(loss) function (mean prediction error) is computed on data not used for finding the model. A low CV value is an indication of a good model.
Cross validation using 10 K-folds
library(boot)
#define a loss function (average prediction error)
loss_func<- function(class, prob){
n_wrong<-abs(class - prob)> 0.5
mean(n_wrong)
}
#average number of wrong prediction
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2378378
cv<- cv.glm(data= alc, cost= loss_func, glmfit= m, K=10)
#average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2432432
The model had a CV of 0.23 which is much better than that in the DataCamp.
super bonus Perform cross-validation to compare the performance of different logistic regression models (=different sets of predictors). Start with a very high number of predictors and explore the changes in the training and testing errors as you move to model with less predictors. draw a graph displaying the trends of both training and testing errors by the number of predictors in the model
#find the model with glm()
summary(m0<- glm(high_use~G3+absences+traveltime+reason+famsize+sex+studytime+famrel+goout, data = alc, family = "binomial"))
##
## Call:
## glm(formula = high_use ~ G3 + absences + traveltime + reason +
## famsize + sex + studytime + famrel + goout, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8894 -0.7459 -0.4418 0.6426 2.7811
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.66560 1.00389 -2.655 0.00792 **
## G3 -0.01447 0.04251 -0.340 0.73350
## absences 0.07678 0.02339 3.282 0.00103 **
## traveltime 0.40320 0.19364 2.082 0.03732 *
## reasonhome 0.34770 0.34200 1.017 0.30930
## reasonother 1.11530 0.46975 2.374 0.01759 *
## reasonreputation -0.16462 0.36307 -0.453 0.65025
## famsizeLE3 0.23888 0.29336 0.814 0.41549
## sexM 0.87539 0.28243 3.100 0.00194 **
## studytime -0.30531 0.18219 -1.676 0.09379 .
## famrel -0.42737 0.14738 -2.900 0.00373 **
## goout 0.78725 0.13037 6.039 1.55e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 346.82 on 358 degrees of freedom
## AIC: 370.82
##
## Number of Fisher Scoring iterations: 5
summary(m1<- glm(high_use~absences+traveltime+reason+sex+studytime+famrel+goout, data = alc, family = "binomial"))
##
## Call:
## glm(formula = high_use ~ absences + traveltime + reason + sex +
## studytime + famrel + goout, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9077 -0.7427 -0.4506 0.6514 2.7420
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.74054 0.85222 -3.216 0.001301 **
## absences 0.07805 0.02326 3.356 0.000791 ***
## traveltime 0.41779 0.19046 2.194 0.028263 *
## reasonhome 0.32961 0.34063 0.968 0.333226
## reasonother 1.09017 0.46850 2.327 0.019970 *
## reasonreputation -0.18187 0.36148 -0.503 0.614869
## sexM 0.89451 0.28156 3.177 0.001488 **
## studytime -0.32030 0.18031 -1.776 0.075671 .
## famrel -0.43481 0.14668 -2.964 0.003033 **
## goout 0.79177 0.12892 6.142 8.16e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 347.54 on 360 degrees of freedom
## AIC: 367.54
##
## Number of Fisher Scoring iterations: 5
summary(m2<- glm(high_use~absences+traveltime+sex+studytime+famrel+goout, data = alc, family = "binomial"))
##
## Call:
## glm(formula = high_use ~ absences + traveltime + sex + studytime +
## famrel + goout, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7951 -0.7373 -0.4716 0.6918 2.5876
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.35248 0.80844 -2.910 0.00362 **
## absences 0.07725 0.02257 3.423 0.00062 ***
## traveltime 0.37496 0.18585 2.018 0.04364 *
## sexM 0.89229 0.27716 3.219 0.00128 **
## studytime -0.38722 0.17561 -2.205 0.02745 *
## famrel -0.41690 0.14401 -2.895 0.00379 **
## goout 0.76216 0.12560 6.068 1.29e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 355.05 on 363 degrees of freedom
## AIC: 369.05
##
## Number of Fisher Scoring iterations: 4
date()
## [1] "Sun Dec 05 17:17:11 2021"
This exercises utilize Boston data found in MASS package.It includes 506 rows and 14 columns. The data is about housing in surburban area in Boston.
The following variables are included:
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(corrplot)
## corrplot 0.92 loaded
library(tidyr)
library(ggplot2)
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
The Boston data set has 506 observations and 14 variables. The observation values were captured either as integers or numeric data types
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
The summary shows the descriptive statistics of each of the variable in the Boston data. The variable tax had the largest range between the minimum and maximum values while chas had the least range
#plot matrix of the variables
pairs(Boston[1:6])
pairs(Boston[7:14])
#between variables in the data
cor_matrix<- cor(Boston)%>% round(digits = 2)
#print the correlation matrix
cor_matrix
## crim zn indus chas nox rm age dis rad tax ptratio
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
## black lstat medv
## crim -0.39 0.46 -0.39
## zn 0.18 -0.41 0.36
## indus -0.36 0.60 -0.48
## chas 0.05 -0.05 0.18
## nox -0.38 0.59 -0.43
## rm 0.13 -0.61 0.70
## age -0.27 0.60 -0.38
## dis 0.29 -0.50 0.25
## rad -0.44 0.49 -0.38
## tax -0.44 0.54 -0.47
## ptratio -0.18 0.37 -0.51
## black 1.00 -0.37 0.33
## lstat -0.37 1.00 -0.74
## medv 0.33 -0.74 1.00
corrplot(cor_matrix, method="circle",type= "upper",
cl.pos="b", tl.pos="d",tl.cex=0.6)
From the correlation plots and tables, the variables tax and rad are highly positive correlated (corr> 0.8). This is further captured in the visualization of the correlation represented by the bigger circle. Medv and lstat, dis and age, and dist and nox have strong negative correlation but not significant.
Standardizing is done through scaling. This entails subtracting the column means from the corresponding columns and divide the difference with standard deviation
#* center and standardize variables
boston_scaled<- scale(Boston)
#summaries of the scaled variables
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
Create of a new a categorical variable of the crime rate in the Boston data set from the scaled crime rate
#class of the boston_scaled object
class(boston_scaled)
## [1] "matrix" "array"
#change the object to data frame
boston_scaled<- as.data.frame(boston_scaled)
#summary of the scaled crime rate
summary(boston_scaled$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
The range of the crime rate is between -0.419 (min) to 9.924 (max), with a mean of 0.
#create a quantile vector of crim and print it
bins<- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
Divides the data into four quantiles.
crime<- cut(boston_scaled$crim, breaks= bins, include.lowest= TRUE, labels=c("low","med_low","med_high","high"))
#look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
Low and high crime rate had 127 observation, while med_low and med_high had 126
#remove the old crime variable from dataset
boston_scaled<- dplyr::select(boston_scaled, -crim)
#add the new categorical value to scaled data
boston_scaled<- data.frame(boston_scaled, crime)
Divide the dataset to train (80%) and test sets (20%). Create the train and test variables and removing the original crime variable.
#number of rows in the Boston dataset
n<-nrow(boston_scaled)
#choose randomly 80% of the rows
ind<- sample(n, size=n * 0.8)
#create train set
train<- boston_scaled[ind,]
#create test set
test<- boston_scaled[-ind,]
#save the correct classes from test data
correct_classes<- test$crime
#remove the crime variable from test data
test<- dplyr::select(test, -crime)
Linear Discriminant Analysis is a classification ( and dimension reduction) method. It finds the (linear) combination of the variable that separate the target variable classes use the categorical crime rate as the target variable and all other variables in the data set as predictor variable
#*Fit the linear discriminant analysis on the train set.
lda.fit<- lda(crime~., data= train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2400990 0.2376238 0.2524752 0.2698020
##
## Group means:
## zn indus chas nox rm age
## low 1.0041769 -0.9057235 -0.15056308 -0.8945285 0.4905586 -0.8702501
## med_low -0.0752177 -0.3173743 0.01475116 -0.5672132 -0.1137993 -0.3367108
## med_high -0.3846714 0.1594678 0.19085920 0.2774312 0.1634965 0.4116630
## high -0.4872402 1.0169738 -0.09172814 1.0474695 -0.4353468 0.7783116
## dis rad tax ptratio black lstat
## low 0.9081784 -0.6882425 -0.7364161 -0.45565185 0.3821828 -0.77534335
## med_low 0.3429964 -0.5559813 -0.4585888 -0.09782398 0.3107015 -0.11802936
## med_high -0.3211029 -0.4234010 -0.3265813 -0.15516592 0.1048981 -0.01445994
## high -0.8355810 1.6395837 1.5150965 0.78247128 -0.9165281 0.94151976
## medv
## low 0.5759520
## med_low 0.0402647
## med_high 0.1891620
## high -0.7407356
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.0819344327 0.71881132 -0.99330500
## indus -0.0004878767 -0.23838425 -0.02438117
## chas -0.0782998113 -0.09905020 0.12063777
## nox 0.3999140865 -0.74081160 -1.32984027
## rm -0.1315869797 -0.10836974 -0.21224708
## age 0.2244498505 -0.32632845 -0.26296059
## dis -0.0406903566 -0.23002557 0.07381587
## rad 3.1781917316 0.94433632 -0.23529902
## tax -0.0137571035 0.03469068 0.94051141
## ptratio 0.0913498538 -0.09333122 -0.39605137
## black -0.1289828416 0.02238688 0.11555948
## lstat 0.2001340984 -0.20816258 0.42710659
## medv 0.1545296239 -0.35817192 -0.15009064
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9511 0.0373 0.0116
LDA determines group means and computes , for each individual, the probability of belonging to the different groups of the target variable. The individual observation is then affected to the group with the highest probability score.
Interpreting the results
Prior probabilities of groups : The proportion of training observation in each group.Such that:
Group means: represent the group center of gravity. It shows the mean of each variable in each group.
Coefficient of linear discriminant: Shows the linear combination of predictor variables that are used to form the LDA decision rule.
Proportion of trace: shows the variability of each linear dimension, with LD1 having the highest with 94.11 while LD3 with the least at 1.62%
LDA Biplot is designed to show how individual and groups are different.
#create function for lda biplot arrows
lda.arrows<- function(x, myscale=1, arrow_heads=0.1, color="orange",
tex= 0.75, choices=c(1,2)){
heads<- coef(x)
arrows(x0 =0, y0=0,
x1= myscale * heads[,choices[1]],
y1= myscale* heads[,choices[2]], col = color, length=
arrow_heads)
text(myscale*heads[,choices], labels=row.names(heads),
cex= tex, col=color, pos=3)
}
#target classes as numeric
classes<- as.numeric(train$crime)
#plot the lda results
plot(lda.fit, dimen=2, col=classes, pch= classes)
lda.arrows(lda.fit, myscale=1)
In LDA biplots the ““arrows” represent the variables. The longer arrows represent more discrimination. In the above plot, rad variable has more variation (larger standard deviation) compared to other variables. This means it is the most influential variable in the model.
In addition, the variables nox and rad are not highly correlated as the angle between them is nearly right angle.Variables nox and zn are negatively correlated.
lda.pred<- predict(lda.fit, newdata=test)
#* cross tabulate the results with the crime categories from the test set.
table(correct= correct_classes, predicted= lda.pred$class)
## predicted
## correct low med_low med_high high
## low 15 13 2 0
## med_low 4 18 8 0
## med_high 0 4 19 1
## high 0 0 0 18
The predicted low and med_high categories had higher proportion of correct cluster observations compared to the test clusters, whereas the test clusters had higher proportion of correct observation in high and med_low categories.
Reload the Boston dataset and standardize
library(MASS)
data("Boston")
#center and standardize variables
Boston_scaled<- scale(Boston)
#summaries of the scaled variables
summary(Boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
str(boston_scaled)
## 'data.frame': 506 obs. of 14 variables:
## $ zn : num 0.285 -0.487 -0.487 -0.487 -0.487 ...
## $ indus : num -1.287 -0.593 -0.593 -1.306 -1.306 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num -0.144 -0.74 -0.74 -0.834 -0.834 ...
## $ rm : num 0.413 0.194 1.281 1.015 1.227 ...
## $ age : num -0.12 0.367 -0.266 -0.809 -0.511 ...
## $ dis : num 0.14 0.557 0.557 1.077 1.077 ...
## $ rad : num -0.982 -0.867 -0.867 -0.752 -0.752 ...
## $ tax : num -0.666 -0.986 -0.986 -1.105 -1.105 ...
## $ ptratio: num -1.458 -0.303 -0.303 0.113 0.113 ...
## $ black : num 0.441 0.441 0.396 0.416 0.441 ...
## $ lstat : num -1.074 -0.492 -1.208 -1.36 -1.025 ...
## $ medv : num 0.16 -0.101 1.323 1.182 1.486 ...
## $ crime : Factor w/ 4 levels "low","med_low",..: 1 1 1 1 1 1 2 2 2 2 ...
From the summary, all the variables have a mean value of zero after being standardized
#euclidean distance matrix
dist_eu<- dist(Boston_scaled)
#summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
#euclidean distance matrix
dist_man<- dist(Boston_scaled, method = 'manhattan')
#summary of the distances
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
Implement K-means analysis with a random number of K-means
#K-means clustering
km<- kmeans(Boston_scaled, centers = 3)
#Plot the Boston_scaled dataset with clusters
pairs(Boston_scaled, col= km$cluster)
All the pairs have been clustered into three clusters- green, black and brown.
To determine the optimal number of Kmeans we look at how the total within cluster sum of squares (WCSS) behaves when the number of cluster changes.When you plot the number pf clusters and the total WCSS, the optimal number of clusters is when the total WCSS drops radically
#set set.seed function to avoid the random assigning of initial clusters centers
set.seed(123) #to avoid random
#determine the number of clusters
k_max<-10
#calculate the total within sum of squares
twcss<- sapply(1:k_max, function(k)
{kmeans(Boston_scaled, k)$tot.withinss})
#visualize the results
qplot(x = 1:k_max, y= twcss, geom= 'line')
From the plot the point at which there was instant change is probably at 2.
This will act as the optimal number of K-means.
Implement the Kmeans with the optimal number of centers = 2
#k-means clustering
km<- kmeans(Boston_scaled, centers= 2)
#plot the Boston dataset with clusters, vary the number of pairs
pairs(Boston_scaled,col = km$cluster)
From the two plot, the two clusters are distinct from each other. This can be attributed to K-means ability to enhance separability of different clusters.
Bonus:
#Reload the Boston datset and standardize the dataset
library(MASS)
data("Boston")
#center and standardize variables
boston_scaled2<- scale(Boston)
#summaries of the scaled variables
summary(boston_scaled2)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
All the variables have been assigned to a mean value of zero.
#k-means clustering
km<- kmeans(boston_scaled2, centers= 5)
#plot the Boston dataset with clusters, vary the number of pairs
pairs(boston_scaled2,col = km$cluster)
All the pairs have 5 clusters
#add the clusters into the boston dataset
boston_scaled2<- data.frame(boston_scaled2, km$cluster)
#number of rows in the Boston dataset
n<-nrow(boston_scaled2)
#choose randomly 80% of the rows
ind<- sample(n, size=n * 0.8)
#create train set
train2<- boston_scaled2[ind,]
#create test set
test2<- boston_scaled2[-ind,]
The original data was divided into 80% training and 20% testing sets.
#*Fit the linear discriminant analysis on the train set.
lda.fit2<- lda(km.cluster~., data= train2)
# print the lda.fit object
lda.fit2
## Call:
## lda(km.cluster ~ ., data = train2)
##
## Prior probabilities of groups:
## 1 2 3 4 5
## 0.07425743 0.13613861 0.23267327 0.20792079 0.34900990
##
## Group means:
## crim zn indus chas nox rm age
## 1 -0.1854493 -0.3157316 0.3133786 3.6647712 0.4306404 0.2138181 0.4290913
## 2 -0.4138882 2.2993839 -1.1620998 -0.2007454 -1.1957952 0.6874068 -1.4644349
## 3 1.2068757 -0.4872402 1.0299433 -0.2723291 0.9969853 -0.4856256 0.7840759
## 4 -0.3322134 -0.4808597 0.5005827 -0.2723291 0.4126493 -0.5353672 0.6272386
## 5 -0.3982818 -0.1104688 -0.6999989 -0.2723291 -0.6224712 0.3392337 -0.4783739
## dis rad tax ptratio black lstat
## 1 -0.4433830 0.009638649 -0.0682569 -0.4290486 0.14108828 -0.1081627
## 2 1.6196570 -0.679093526 -0.5526750 -0.8218082 0.34273326 -0.9422548
## 3 -0.8263352 1.635167428 1.5322534 0.8052870 -0.80749211 0.9509017
## 4 -0.5482077 -0.585376631 -0.1979161 0.1772572 0.05114137 0.4788749
## 5 0.4272987 -0.554250499 -0.7413582 -0.3548817 0.37017778 -0.5635095
## medv
## 1 0.5694394
## 2 0.7486460
## 3 -0.8219329
## 4 -0.4545371
## 5 0.4426882
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3 LD4
## crim 0.034626545 0.011867261 -0.1559040056 0.11802273
## zn 0.297786175 -0.287582562 -1.4864689354 -1.24084138
## indus -0.054367812 0.370023831 0.1632260886 -0.40109753
## chas -5.190618709 -0.398296510 -0.2982405218 -0.03794020
## nox 0.005563069 -0.287697635 0.0873598437 -0.49971441
## rm 0.055060927 0.015978997 -0.1245985638 0.09946824
## age -0.123498535 0.189920047 0.4227915007 -0.28603098
## dis -0.006874898 -0.462773844 -0.4034456808 0.33518750
## rad -0.304219079 2.649304175 -1.0824093907 2.02998628
## tax -0.017448586 -0.004645183 -0.6445853812 -1.10990584
## ptratio -0.016550862 0.029361089 0.1926290390 -0.61540473
## black -0.002242608 -0.096184981 0.0315333992 0.09825891
## lstat 0.082731274 0.294123773 -0.0028274113 -0.26817140
## medv 0.146425853 -0.308358740 -0.0006178331 -0.03878442
##
## Proportion of trace:
## LD1 LD2 LD3 LD4
## 0.6388 0.2588 0.0818 0.0206
Interpreting the results
Prior probabilities of groups : The proportion of training observation in each group.Such that:
Group means: represent the group center of gravity. It shows the mean of each variable in each group.
Coefficient of linear discriminant: Shows the linear combination of predictor variables that are used to form the LDA decision rule.
Proportion of trace: shows the variability of each linear dimension as follows:
Draw the LDA (bi)plot
#create function for lda biplot arrows
lda.arrows<- function(x, myscale=1, arrow_heads=0.1, color="orange",
tex= 0.75, choices=c(1,2)){
heads<- coef(x)
arrows(x0 =0, y0=0,
x1= myscale * heads[,choices[1]],
y1= myscale* heads[,choices[2]], col = color, length=
arrow_heads)
text(myscale*heads[,choices], labels=row.names(heads),
cex= tex, col=color, pos=3)
}
#target classes as numeric
classes<- as.numeric(train2$km.cluster)
#plot the lda results
plot(lda.fit2, dimen=2, col=classes, pch= classes)
lda.arrows(lda.fit2, myscale=1)
From the plot, , char variable has more variation (larger standard deviation) and is the most influential variable compared to other variables as represented by the longest arrow, this is followed by variable rad. Moreover, the two variable chas and rad are not highly correlated as the angle between them is nearly a right angle
model_predictors<- dplyr::select(train,-crime)
#check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
#matrix multiplication
matrix_product<- as.matrix(model_predictors)%*%
lda.fit$scaling
matrix_product<-as.data.frame(matrix_product)
Access plotly package to create a 3D plot of the columns of the matrix product.
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(x= matrix_product$LD1, y= matrix_product$LD2,
z= matrix_product$LD3, type='scatter3d', mode= 'markers')
date()
## [1] "Sun Dec 05 17:17:26 2021"
Human dataset variables from UNDP. The data combines several indicators from most countries in the world. The variables in the dataset under two categories: Health and knowledge, and empowerment
#set the workdirectory
setwd("C:/LocalData/ocholla/IODS-project/Data")
library(dplyr)
library(corrplot)
library(ggplot2)
#Load the human data
human<- read.csv("human.csv", header = TRUE)
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : int 132 109 124 112 113 111 103 123 108 93 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
The dataset contains 155 observation of 8 variables namely
summary(human)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 1.0 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 39.5 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 78.0 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 78.0 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.:116.5 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :155.0 Max. :1100.0 Max. :204.80 Max. :57.50
The variables have varying mean values
#plot matrix of the variables
pairs(human)
#create a correlation matrix to show the correlation
#between variables in the data
cor_matrix<- cor(human) %>% round(digits = 2)
#print the correlation matrix
cor_matrix
## Edu2.FM Labo.FM Edu.Exp Life.Exp GNI Mat.Mor Ado.Birth Parli.F
## Edu2.FM 1.00 0.01 0.59 0.58 0.18 -0.66 -0.53 0.08
## Labo.FM 0.01 1.00 0.05 -0.14 -0.01 0.24 0.12 0.25
## Edu.Exp 0.59 0.05 1.00 0.79 0.12 -0.74 -0.70 0.21
## Life.Exp 0.58 -0.14 0.79 1.00 0.19 -0.86 -0.73 0.17
## GNI 0.18 -0.01 0.12 0.19 1.00 -0.10 -0.14 0.01
## Mat.Mor -0.66 0.24 -0.74 -0.86 -0.10 1.00 0.76 -0.09
## Ado.Birth -0.53 0.12 -0.70 -0.73 -0.14 0.76 1.00 -0.07
## Parli.F 0.08 0.25 0.21 0.17 0.01 -0.09 -0.07 1.00
From the matrix,
#visualize the correlation matrix
corrplot(cor_matrix, method="circle",type= "upper",
cl.pos="b", tl.pos="d",tl.cex=0.6)
From the visualization plot,
Principal component analysis (PCA) is used to summarize and visualize the information in a data set containing observations described by multiple inter-correlated quantitative variables.
PCA is used to extract the important information from multivariate data table and to express this information as set of few new variables called principal components. These new variables correspond to linear combination of the originals. The number of principal components is less than or equal to the number of original variables.
PCA have the following properties:
Main purpose of principal component analysis is to:
pca_human<- prcomp(human)
#print summary of the pca_human
s<-summary(pca_human)
s
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 214.2937 44.75162 26.34667 11.4791 4.06656 1.60664
## Proportion of Variance 0.9416 0.04106 0.01423 0.0027 0.00034 0.00005
## Cumulative Proportion 0.9416 0.98267 0.99690 0.9996 0.99995 1.00000
## PC7 PC8
## Standard deviation 0.1905 0.1587
## Proportion of Variance 0.0000 0.0000
## Cumulative Proportion 1.0000 1.0000
PC1 has the highest variance at 0.9416 while PC8 had the least at 0.0.
PC1, PC2 and PC3 contribute 99.69% of the cumulative variability. The PC variance decreases as the number of components increase
#round the percentage of variance captured by each PC
pca_pr<-round(100*s$importance[2,],
digits = 1)
#print out the percentages of variance
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 94.2 4.1 1.4 0.3 0.0 0.0 0.0 0.0
The percentage of the variance of all the components
#create object pc_lab to be used as axis labels
pc_lab<- paste0(names(pca_pr), "(",pca_pr,"%)")
biplot(pca_human, choices = 1:2, cex = c(0.8,1), col = c("grey40","deeppink2"),
xlab=pc_lab[1],ylab= pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
From the principal component analysis, using the first two principal components, variance of PC1 is 94.2% while PC2 is 4.1%. This means that majority of variance from the features in the original data are contained in PC1.
Maternal Mortality is the most influential variable in PC1, whereas in PC2 its GNI, the two variables are not correlated, as the angle between them is almost right angle. In addition, maternal mortality are also the most influential variable,highest standard deviation, in the data set as it has the longest arrow.
human_std<- scale(human)
#check the summaries of the standardized variables
summary(human_std)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7378 Min. :-2.7188
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6782 1st Qu.:-0.6425
## Median : 0.3503 Median : 0.2316 Median : 0.1140 Median : 0.3056
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.7126 3rd Qu.: 0.6717
## Max. : 2.6646 Max. : 1.6632 Max. : 2.4730 Max. : 1.4218
## GNI Mat.Mor Ado.Birth Parli.F
## Min. :-1.7154 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.8577 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median : 0.0000 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.8577 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 1.7154 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
The mean of all the variables have been scaled to zero.
pca_human_std<-prcomp(human_std)
#print summary of the pca_human
s1<-summary(pca_human_std)
s1
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.9659 1.1387 0.9896 0.8662 0.69949 0.54002 0.46700
## Proportion of Variance 0.4831 0.1621 0.1224 0.0938 0.06116 0.03645 0.02726
## Cumulative Proportion 0.4831 0.6452 0.7676 0.8614 0.92254 0.95899 0.98625
## PC8
## Standard deviation 0.33165
## Proportion of Variance 0.01375
## Cumulative Proportion 1.00000
PC1 had the highest variance at 0.48 compared to PC8 at 0.013
#round the percentage of variance captured by each PC
pca_pr1<-round(100*s1$importance[2,],
digits = 1)
#print out the percentages of variance
pca_pr1
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 48.3 16.2 12.2 9.4 6.1 3.6 2.7 1.4
Percentage of the variance of the components generated, with PC1 at 48.3% while the least PC8 at 1.4%.
#create object pc_lab to be used as axis labels
pc_lab1<- paste0(names(pca_pr1), "(",pca_pr1,"%)")
#draw biplot of the principal component representation and the original variables
biplot(pca_human_std, choices = 1:2, cex = c(0.8,1), col = c("grey40","deeppink2"),
xlab=pc_lab1[1],ylab= pc_lab1[2])
From the plot:
Are the results different? The results are different.
Why or why not? Non-standardized data have high variance or variability compared to the standardized data. This is because non standardized PCA is affected by noise and outliers in the variables.This is demonstrated by the single long arrow seen in the non-standardized PCA for Mat.Mortality. On the other hand standardized variables have standard deviation of one and mean is zero, eliminating the impact of outliers leading to variables having almost the same influence. This can be visualised by the size of arrows in the biplot as they almost have same equal influence in the different components.
MCA is dimensionality reduction method that is used to analyse the pattern of relationship of several categorical variables.
The goal of MCA is to identify the association between variable categories.
Using dataset tea from the FactoMineR package.
The tea data consist tea consumers survey about their consumption of tea.The questions were about how they consume tea, how they think of tea and descriptive questions(sex, age, socio-professional category and sport practise).
#*Load the tea dataset from the package Factominer
library(FactoMineR)
library(ggplot2)
library(tidyr)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
data("tea")
#*Look at the structure and dimensions of the data
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
The dataset contains 300 observations of 36 variables, except of the age, all variables are categorical.
#columns to keep in the dataset
keep_columns<- c("Tea","How","how","sugar","where","lunch")
#select the keep_columns to create a new dataset
tea_time<- dplyr::select(tea, one_of(keep_columns))
#look at the summaries and structure of the data
summary(tea_time)
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
The selected columns had 2-4 levels of categorization.
Plot the frequency of the variable categories
gather(tea_time)%>% ggplot(aes(value))+
facet_wrap("key", scales= "free")+
geom_bar()+theme(axis.text.x =
element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
From the plots:
#* Multiple Correspondence Analysis
mca<- MCA(tea_time, graph = FALSE)
#summary
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.279 0.261 0.219 0.189 0.177 0.156 0.144
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519 7.841
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953 77.794
## Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.141 0.117 0.087 0.062
## % of var. 7.705 6.392 4.724 3.385
## Cumulative % of var. 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139 0.003
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626 0.027
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111 0.107
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841 0.127
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979 0.035
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990 0.020
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347 0.102
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459 0.161
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968 0.478
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898 0.141
## v.test Dim.3 ctr cos2 v.test
## black 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 2.867 | 0.433 9.160 0.338 10.053 |
## green -5.669 | -0.108 0.098 0.001 -0.659 |
## alone -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 3.226 | 1.329 14.771 0.218 8.081 |
## milk 2.422 | 0.013 0.003 0.000 0.116 |
## other 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
The EigenValues shows the variance and the percentage of variance retained by each dimension. Dimension 1 had the highest variance at 15.23% while Dimension 11 had the least with 3.385%.
Individuals- shows the first the individuals coordinates, the individual contribution (%) on the dimension and the cos2(the squared correlation) on the dimensions.
Categories: shows the coordinates of the variable categories, the contribution (%), the cos2 (the squared correlations) and v.test value. The v.test follows normal distribution : if the value is below/above \(\pm\) 1.96, the coordinate is significantly different from zero.
categorical variables- shows the squared correlation between each variable and the dimensions. if the value is close to one it indicates a strong line with the variable and dimension.
#print the output of the MCA() function
print(mca)
## **Results of the Multiple Correspondence Analysis (MCA)**
## The analysis was performed on 300 individuals, described by 6 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. of the categories"
## 4 "$var$cos2" "cos2 for the categories"
## 5 "$var$contrib" "contributions of the categories"
## 6 "$var$v.test" "v-test for the categories"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "intermediate results"
## 12 "$call$marge.col" "weights of columns"
## 13 "$call$marge.li" "weights of rows"
#Visualize the proportion of variance retained by different dimensions
eig.val<- get_eigenvalue(mca)
head(eig.val)
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 0.2793712 15.238428 15.23843
## Dim.2 0.2609265 14.232352 29.47078
## Dim.3 0.2193358 11.963768 41.43455
## Dim.4 0.1894379 10.332978 51.76753
## Dim.5 0.1772231 9.666715 61.43424
## Dim.6 0.1561774 8.518770 69.95301
Dimension 1 had the highest eigenvalue (0.279) and variance percentage at 15.23%. The variance percentage decreased as the dimension increased, while the cumulative variance percentage increased.
fviz_screeplot(mca, addlabels = TRUE, ylim=c(0,45))
The plot displays how the percentage of variances was distribution across the dimensions. The variance % declined as dimension increased.
fviz_mca_biplot(mca,
repel = TRUE,
ggtheme = theme_minimal())
## Warning: ggrepel: 244 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
The plot shows a global pattern within the data. Rows (individuals) are represented by blue points and columns (variable categories) by red triangles.
The distance between any row points or column points gives a measure of their similarity (or dissimilarity). Columns such are tea bag, sugar, chain store and alone have similarity based on the short distnace between them. Compared to variables such as green, tea shop and other.
fviz_mca_var(mca, choice = "mca.cor",
repel = TRUE,
ggtheme = theme_minimal())
The plot assist to identify variables that are most correlated with each dimension.
From the plot:
Visualizes the variables which belong to similar group and if they are positively or negatively correlated.
#display the coordinates of each variable categories in each dimension
#head(round(var$coord, 2),4)
#visualize only variables
fviz_mca_var(mca, col.var="black", shape.var = 15,
repel = TRUE)
From the plot:
#visualize MCA
plot(mca, invisible=c("ind"), habillage="quali", graph.type="classic")
MCA factor map shows the distribution of the variables across the two dimension and how far they are from the origin. Variables closer to the origin has little influence or contribute little to the variability of the dimension.
The variable categories with the larger value contribute the most to the definition of the dimensions. These variables are the most important in explaining the variability in the data set.
fviz_contrib(mca, choice= "var", axes=1, top= 20) # top 20 variable categories
In Dimension 1, tea shop, unpackaged , tea bag and chain store are the most important variables in explaining variability in dimension 1 with tea shop being the most important variable contributing approximately 25% of Dimension 1 variability.
fviz_contrib(mca, choice= "var", axes = 2, top = 20)
In dimension 2, variable chain store +tea shop is the most important variable as it contributes almost 30% of Dimension 2 variability. Other important variables include:
date()
## [1] "Sun Dec 05 17:17:33 2021"
(more chapters to be added similarly as we proceed with the course!)